Airport delays are one of the most common problems people face upon travelling. These delays are usually associated with certain carriers or destinations. Thus the question is, are they really the cause of these delays or is it due to other reasons?
In 2013, data about 336,776 flights were collected to answer this question, the data collected were about flights departing from New York City across all of its airports: John F. Kennedy International Airport (JFK), Newark Liberty International Airport (EWR) and LaGuardia Airport (LGA) to destinations all over the United States and some of its territories (Puerto Rico, and the American Virgin Islands). The data was described by 19 attributes shown in Table 1 followed by a sample of the data in Table 2.
The aim behind this analysis is to confirm whether there is a relationship between New York City’s flight delays and attributes of the dataset. We hypothesize that flight distance and destination are major contributors to the delays and that different destinations with different distances will have less delays.
| Attribute | Description |
|---|---|
| year | Year of departure. |
| month | Month of departure. |
| day | Day of departure. |
| dep_time | Actual departure time (format HHMM or HMM), local time zone. |
| arr_time | Actual arrival time (format HHMM or HMM), local time zone. |
| sched_dep_time | Scheduled departure time (format HHMM or HMM), local time zone. |
| sched_dep_time | Scheduled arrival time (format HHMM or HMM), local time zone. |
| dep_delay | Departure delay, in minutes. Negative times represent early departures. |
| arr_delay | Arrival delay, in minutes. Negative times represent early arrivals. |
| carrier | Two letter carrier abbreviation. |
| flight | Flight number. |
| tailnum | Plane tail number. |
| origin | Flight origin |
| dest | Flight destination. |
| air_time | Amount of time spent in the air, in minutes. |
| distance | Distance between airports, in miles. |
| hour | Hour of scheduled departure. |
| minute | Minutes of scheduled departure. |
| time_hour | Scheduled date and hour of the flight as a POSIXct date. |
Table 1: A list of all attributes and its descriptions that are present in the dataset
Table 2: A sample of the dataset and how it is structured
Now lets take a look at our data:
## year month day dep_time sched_dep_time
## 0 0 0 0 0
## dep_delay arr_time sched_arr_time arr_delay carrier
## 0 0 0 0 0
## flight tailnum origin dest air_time
## 0 0 0 0 0
## distance hour minute time_hour
## 0 0 0 0
## Rows: 327,346
## Columns: 19
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…
## year month day dep_time sched_dep_time
## Min. :2013 Min. : 1.000 Min. : 1.00 Min. : 1 Min. : 500
## 1st Qu.:2013 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.: 907 1st Qu.: 905
## Median :2013 Median : 7.000 Median :16.00 Median :1400 Median :1355
## Mean :2013 Mean : 6.565 Mean :15.74 Mean :1349 Mean :1340
## 3rd Qu.:2013 3rd Qu.:10.000 3rd Qu.:23.00 3rd Qu.:1744 3rd Qu.:1729
## Max. :2013 Max. :12.000 Max. :31.00 Max. :2400 Max. :2359
## dep_delay arr_time sched_arr_time arr_delay
## Min. : -43.00 Min. : 1 Min. : 1 Min. : -86.000
## 1st Qu.: -5.00 1st Qu.:1104 1st Qu.:1122 1st Qu.: -17.000
## Median : -2.00 Median :1535 Median :1554 Median : -5.000
## Mean : 12.56 Mean :1502 Mean :1533 Mean : 6.895
## 3rd Qu.: 11.00 3rd Qu.:1940 3rd Qu.:1944 3rd Qu.: 14.000
## Max. :1301.00 Max. :2400 Max. :2359 Max. :1272.000
## carrier flight tailnum origin
## Length:327346 Min. : 1 Length:327346 Length:327346
## Class :character 1st Qu.: 544 Class :character Class :character
## Mode :character Median :1467 Mode :character Mode :character
## Mean :1943
## 3rd Qu.:3412
## Max. :8500
## dest air_time distance hour
## Length:327346 Min. : 20.0 Min. : 80 Min. : 5.00
## Class :character 1st Qu.: 82.0 1st Qu.: 509 1st Qu.: 9.00
## Mode :character Median :129.0 Median : 888 Median :13.00
## Mean :150.7 Mean :1048 Mean :13.14
## 3rd Qu.:192.0 3rd Qu.:1389 3rd Qu.:17.00
## Max. :695.0 Max. :4983 Max. :23.00
## minute time_hour
## Min. : 0.00 Min. :2013-01-01 05:00:00.00
## 1st Qu.: 8.00 1st Qu.:2013-04-05 06:00:00.00
## Median :29.00 Median :2013-07-04 09:00:00.00
## Mean :26.23 Mean :2013-07-03 17:56:45.44
## 3rd Qu.:44.00 3rd Qu.:2013-10-01 18:00:00.00
## Max. :59.00 Max. :2013-12-31 23:00:00.00
Although 3 airports per city may seem large, larger than many of the world’s capitals, the city that never sleeps hosts only 3 out of 16 airports of the state of New York which itself is not the host of the largest number of airports per state as shown in Figure 1.
Figure 1: An interactive map showing the number of airports per state. Puerto Rico (7 airports) and U.S. Virgin Islands (2 airports) are not shown.
Based on the figure we can observe that:
The three airports of the city of New York serve different
purposes–such as domestic or international–and hence should have
different number of flights, let us confirm that using Figure 2.
Figure 2: Number of flights per NYC airport
From the figure above, we can see that all of the airports have relatively similar numbers with LaGuardia Airport (LGA) having the lowest, probably due the fact that it is a domestic airport. Newark Liberty International Airport (EWR) has the largest share of flights, which can be explained by looking at its location, where it lays on the border between New York state and New Jersey state, making it a strategic location and more favourable over JFK International Airport.
And despite the large number of flights and airports around the
United States, Figure 3 showed that some states were never reached from
NYC airports during 2013.
Figure 3: Map showing flights per state as destination
The map shows us that the most visited destination is Florida, followed by California by almost half of the number. The map also shows that there are 8 states with zero flights towards it namely: Mississippi, Kansas, Idaho, New Hampshire, New Jersey, Delaware, South and North Dakota.
Figure 4: Number of delays per NYC airport
The figure features positive and negative points–indicating departure/arrival was before time or ahead of scheduled time. The delay average across all airports is +9.7 minutes. Although each airport shows a different average, it is not enough to say that one airport will have a certain delay time.
To further investigate that, let us take a look at the attributes and how they affect each other.
Figure 5: Correlation matrix of the dataset attributes
Here we see the correlation matrix between the main numeric attributes.
distance and the delays (dep_delay and
arr_delay) which was the opposite of what we expected.air_time (time airplane
spent in air) and the delays.distance and air_time and the bottom between
the attributes: departure time, scheduled departure time, arrival time,
scheduled arrival time which is expected.Although the correlation matrix showed us the relationship between
many variables, it did not mention a very important aspect of flights,
time.
Figure 6: Departure delays per hour
The plot above indicate that the lowest number of departure delays occur on the early hours of day at 5 am, and late at night at 10-11pm. And the worst time is between 4-7 pm
Figure 7: Arrival delays per hour
The plot above indicate similar result to the previous one, with the lowest number of arrival delays occur on the early hours of day at 5 am, and late at night at 10-11pm. And the worst time is between 4-7 pm
Figure 8: Departure delays per month
Figure 9: Arrival delays per month
We can see from the first plot that the highest number of departure delays occur on month 6 (Jun), 7 (Jul), and 12 (Dec), which indicate there are more departure delays during summer and winter breaks.
Similar to the previous plot, in the second plot we see highest
number of arrival delays occur on month 7 (Jul), and 12 (Dec) during the
summer and winter breaks.
Figure 10: Number of flights and their departure delays month
Figure 11: Number of flights and their arrival delays month
In the first plot on the left, we can see that as the number of
flights increases, the number of departure delays also increase. If we
check for the months with the highest delays, 6 (Jun), 7 (Jul), and 12
(Dec). We see they also have the highest number of flights compared to
other months.
The second plot is similar to the first where the number of
flights increases, the number of Arrival delays also increase. Except
that for month 8 (Aug), where the number of flights were high but the
arrival delays were relatively lower than months with less flights.
Figure 12: Percentage of flights’ departure and arrival delays
We can see that almost 39% of the NYC flights in the year 2013 had a departure delay, only 5% departed on time and 55.9% departed before time. As for the arrival delays, we can see that it doesn’t differ that much from the departure delays.
| carrier | no_flights | low_delay | medium_delay | high_delay | overall_delay |
|---|---|---|---|---|---|
| UA | 57782 | 26% | 14% | 7% | 47% |
| B6 | 54049 | 17% | 14% | 8% | 40% |
| EV | 51108 | 15% | 16% | 13% | 45% |
| DL | 47658 | 16% | 10% | 6% | 32% |
| AA | 31947 | 16% | 9% | 6% | 32% |
| MQ | 25037 | 11% | 13% | 8% | 32% |
| US | 19831 | 12% | 8% | 4% | 24% |
| 9E | 17294 | 15% | 14% | 11% | 40% |
| WN | 12044 | 27% | 17% | 9% | 54% |
| VX | 5116 | 26% | 10% | 7% | 43% |
| FL | 3175 | 25% | 16% | 10% | 52% |
| AS | 709 | 18% | 7% | 6% | 32% |
| F9 | 681 | 22% | 16% | 11% | 50% |
| YV | 544 | 14% | 14% | 14% | 43% |
| HA | 342 | 13% | 4% | 3% | 20% |
| OO | 29 | 10% | 7% | 14% | 31% |
Table 3: Carriers, their number of flights and the percentage of flights with delays
WN carriers tend to have the highest percentage of overall delays, and US carriers tend to have a low percentage of delays compared to its number of flights.
Figure 13: Carriers, their number of flights and the percentage of flights with delays
From the horizontal stacked bar it’s clear that carriers UA, EV, B6 and DL have the highest frequency of delays, also these carriers have the highest number of flights, which can tell us that carriers having a high number of flights tend to have a high frequency of delays.
| carrier | highest_delay | avg_delay |
|---|---|---|
| F9 | 853 | 20.201175 |
| EV | 548 | 19.838929 |
| YV | 387 | 18.898897 |
| FL | 602 | 18.605984 |
| WN | 471 | 17.661657 |
| 9E | 747 | 16.439574 |
| B6 | 502 | 12.967548 |
| VX | 653 | 12.756646 |
| OO | 154 | 12.586207 |
| UA | 483 | 12.016908 |
| MQ | 1137 | 10.445381 |
| DL | 960 | 9.223950 |
| AA | 1014 | 8.569130 |
| AS | 225 | 5.830748 |
| HA | 1301 | 4.900585 |
| US | 500 | 3.744693 |
Figure 14: Carriers and their maximum and average delays
The carrier with the highest delay time is HA with 1301 min delay and the carrier with the highest avg delay is F9 with an average of 20.2
After conducting the EDA, we have found out that delays on average do not exceed 15 minutes. The source of the delays vary and is related to multiple factors, notably: date and time of the flight and the flight carrier.
Wickham H (2022). nycflights13: Flights that Departed NYC in 2013. R package version 1.0.2, https://github.com/hadley/nycflights13.